In today’s data-driven world, the ability to forecast accurately is crucial for strategic planning and decision-making. This project aims to harness the power of historical data to generate reliable forecasts for various series within a de-identified dataset. With a comprehensive dataset spanning 1622 periods, we will employ advanced data analysis and forecasting techniques to predict future trends for the next 140 periods.
This report is crafted to cater to a diverse audience, ranging from individuals with no background in data science to seasoned data scientists. Our goal is to present the analysis and findings in a clear, concise, and accessible manner. We will explain the methodologies used, the rationale behind their selection, and the step-by-step process of our analysis, ensuring transparency and comprehensibility for all readers.
The report is structured to balance technical rigor with simplicity, providing a thorough yet comprehensible narrative of our forecasting journey. We begin with an exploration of the data, followed by a detailed explanation of the forecasting methods applied, and conclude with the results and their implications. Visualizations and key insights will be highlighted to enhance understanding and readability.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'zoo'
##
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
Read the excel file and extract the data
## Response [https://raw.githubusercontent.com/waheeb123/Data-624/main/Projects/Data%20Set%20for%20Class.xls]
## Date: 2024-06-22 05:44
## Status: 200
## Content-Type: application/octet-stream
## Size: 1.33 MB
## <ON DISK> /var/folders/lt/23m_3s6j42vgpv1ss4d6nkgc0000gn/T//RtmpU0d2eF/file8ab3ae6ddb9.xls
using glimpse, so we can view the attributes of our dataset
## Rows: 10,572
## Columns: 7
## $ SeriesInd <dbl> 40669, 40669, 40669, 40669, 40669, 40669, 40670, 40670, 4067…
## $ category <chr> "S03", "S02", "S01", "S06", "S05", "S04", "S03", "S02", "S01…
## $ Var01 <dbl> 30.64286, 10.28000, 26.61000, 27.48000, 69.26000, 17.20000, …
## $ Var02 <dbl> 123432400, 60855800, 10369300, 39335700, 27809100, 16587400,…
## $ Var03 <dbl> 30.34000, 10.05000, 25.89000, 26.82000, 68.19000, 16.88000, …
## $ Var05 <dbl> 30.49000, 10.17000, 26.20000, 27.02000, 68.72000, 16.94000, …
## $ Var07 <dbl> 30.57286, 10.28000, 26.01000, 27.32000, 69.15000, 17.10000, …
Dataset have 10,572 rows and 7 columns.
| SeriesInd | category | Var01 | Var02 | Var03 | Var05 | Var07 |
|---|---|---|---|---|---|---|
| 40669 | S03 | 30.64286 | 123432400 | 30.34 | 30.49 | 30.57286 |
| 40669 | S02 | 10.28000 | 60855800 | 10.05 | 10.17 | 10.28000 |
| 40669 | S01 | 26.61000 | 10369300 | 25.89 | 26.20 | 26.01000 |
| 40669 | S06 | 27.48000 | 39335700 | 26.82 | 27.02 | 27.32000 |
| 40669 | S05 | 69.26000 | 27809100 | 68.19 | 68.72 | 69.15000 |
| 40669 | S04 | 17.20000 | 16587400 | 16.88 | 16.94 | 17.10000 |
check if there are any missing values in the entire dataframe
## [1] TRUE
count the number of missing values in the entire dataframe
## [1] 4294
In order to forecast each variable within each group, we needed to break the larger dataset into its individual groups so we could perform our analysis and properly visualize our data.
We can see variation within this type of variable is considerably less than in the the other type. Additionally, it looks like there is no apparent seasonality. In many of the variables of this type of data, we did see long trends (either upward or downward) and cyclicity. In looking at all the plots for every variable, we determined that further analysis was necessary to determine trend and seasonal components of the data.
Many of the time series algorithms, both for forecasting and visualization, require that there be no missing values in the data. With this requirement, we deemed it necessary to fill nulls early in our data prep process.
In several of the datasets, there are clear outliers that would strongly influence the models we built.We used the Interquartile Range (IQR). The IQR is a measure of statistical dispersion, or how spread out the values in a dataset are. The process involves identifying values that fall within 1.5 times the IQR below the first quartile (Q1) and above the third quartile (Q3). Values outside this range are considered outliers.
# Filtering outliers for each subset
subset_S01_clean_Var01 <- subset_S01 %>%
filter(Var01 >= quantile(Var01, 0.25, na.rm = TRUE) - 1.5 * IQR(Var01, na.rm = TRUE) &
Var01 <= quantile(Var01, 0.75, na.rm = TRUE) + 1.5 * IQR(Var01, na.rm = TRUE))
subset_S01_clean_Var02 <- subset_S01 %>%
filter(Var02 >= quantile(Var02, 0.25, na.rm = TRUE) - 1.5 * IQR(Var02, na.rm = TRUE) &
Var02 <= quantile(Var02, 0.75, na.rm = TRUE) + 1.5 * IQR(Var02, na.rm = TRUE))
subset_S02_clean_Var02 <- subset_S02 %>%
filter(Var02 >= quantile(Var02, 0.25, na.rm = TRUE) - 1.5 * IQR(Var02, na.rm = TRUE) &
Var02 <= quantile(Var02, 0.75, na.rm = TRUE) + 1.5 * IQR(Var02, na.rm = TRUE))
subset_S02_clean_Var03 <- subset_S02 %>%
filter(Var03 >= quantile(Var03, 0.25, na.rm = TRUE) - 1.5 * IQR(Var03, na.rm = TRUE) &
Var03 <= quantile(Var03, 0.75, na.rm = TRUE) + 1.5 * IQR(Var03, na.rm = TRUE))
subset_S03_clean_Var05 <- subset_S03 %>%
filter(Var05 >= quantile(Var05, 0.25, na.rm = TRUE) - 1.5 * IQR(Var05, na.rm = TRUE) &
Var05 <= quantile(Var05, 0.75, na.rm = TRUE) + 1.5 * IQR(Var05, na.rm = TRUE))
subset_S03_clean_Var07 <- subset_S03 %>%
filter(Var07 >= quantile(Var07, 0.25, na.rm = TRUE) - 1.5 * IQR(Var07, na.rm = TRUE) &
Var07 <= quantile(Var07, 0.75, na.rm = TRUE) + 1.5 * IQR(Var07, na.rm = TRUE))
subset_S04_clean_Var01 <- subset_S04 %>%
filter(Var01 >= quantile(Var01, 0.25, na.rm = TRUE) - 1.5 * IQR(Var01, na.rm = TRUE) &
Var01 <= quantile(Var01, 0.75, na.rm = TRUE) + 1.5 * IQR(Var01, na.rm = TRUE))
subset_S04_clean_Var02 <- subset_S04 %>%
filter(Var02 >= quantile(Var02, 0.25, na.rm = TRUE) - 1.5 * IQR(Var02, na.rm = TRUE) &
Var02 <= quantile(Var02, 0.75, na.rm = TRUE) + 1.5 * IQR(Var02, na.rm = TRUE))
subset_S05_clean_Var02 <- subset_S05 %>%
filter(Var02 >= quantile(Var02, 0.25, na.rm = TRUE) - 1.5 * IQR(Var02, na.rm = TRUE) &
Var02 <= quantile(Var02, 0.75, na.rm = TRUE) + 1.5 * IQR(Var02, na.rm = TRUE))
subset_S05_clean_Var03 <- subset_S05 %>%
filter(Var03 >= quantile(Var03, 0.25, na.rm = TRUE) - 1.5 * IQR(Var03, na.rm = TRUE) &
Var03 <= quantile(Var03, 0.75, na.rm = TRUE) + 1.5 * IQR(Var03, na.rm = TRUE))
subset_S06_clean_Var05 <- subset_S06 %>%
filter(Var05 >= quantile(Var05, 0.25, na.rm = TRUE) - 1.5 * IQR(Var05, na.rm = TRUE) &
Var05 <= quantile(Var05, 0.75, na.rm = TRUE) + 1.5 * IQR(Var05, na.rm = TRUE))
subset_S06_clean_Var07 <- subset_S06 %>%
filter(Var07 >= quantile(Var07, 0.25, na.rm = TRUE) - 1.5 * IQR(Var07, na.rm = TRUE) &
Var07 <= quantile(Var07, 0.75, na.rm = TRUE) + 1.5 * IQR(Var07, na.rm = TRUE))In data preparation, imputation of missing values is a critical step. Each of the variables provided had missing values within the data and many approaches for filling them seemed appropriate. The approaches we took differed depending on if we were imputing for the first or second type of data. For the data that had large variations from day to day (that looked almost like white noise) we deemed taking an average appropriate. For the more stable datasets, we decided to use linear interpolation to fill the missing values with the previous value in the dataset. This seemed to make sense as each point in this datasets never seemed to be very far away from the previous one.
# Impute missing values using linear interpolation for subsets
subset_S06_clean_Var07$Var05 <- na.approx(subset_S06_clean_Var07$Var05)
subset_S06_clean_Var07$Var07 <- na.approx(subset_S06_clean_Var07$Var07)
subset_S05_clean_Var02$Var03 <- na.approx(subset_S05_clean_Var02$Var03)
subset_S05_clean_Var02$Var02 <- na.approx(subset_S05_clean_Var02$Var02)
subset_S04_clean_Var01$Var02 <- na.approx(subset_S04_clean_Var01$Var02)
subset_S04_clean_Var01$Var01 <- na.approx(subset_S04_clean_Var01$Var01)
subset_S03_clean_Var07$Var05 <- na.approx(subset_S03_clean_Var07$Var05)
subset_S03_clean_Var07$Var07 <- na.approx(subset_S03_clean_Var07$Var07)
subset_S02_clean_Var02$Var03 <- na.approx(subset_S02_clean_Var02$Var03)
subset_S02_clean_Var02$Var02 <- na.approx(subset_S02_clean_Var02$Var02)
subset_S01_clean_Var01$Var02 <- na.approx(subset_S01_clean_Var01$Var02)
subset_S01_clean_Var01$Var01 <- na.approx(subset_S01_clean_Var01$Var01)
# Find the last observation index for subsets
last_observation_index_S06 <- max(which(!is.na(subset_S06_clean_Var07$Var05)))
last_observation_index_S05 <- max(which(!is.na(subset_S05_clean_Var02$Var03)))
last_observation_index_S04 <- max(which(!is.na(subset_S04_clean_Var01$Var02)))
last_observation_index_S03 <- max(which(!is.na(subset_S03_clean_Var07$Var05)))
last_observation_index_S02 <- max(which(!is.na(subset_S02_clean_Var02$Var03)))
last_observation_index_S01 <- max(which(!is.na(subset_S01_clean_Var01$Var02)))
# Create time series objects for forecasting
ts_S06_Var05 <- ts(subset_S06_clean_Var07$Var05[1:last_observation_index_S06])
ts_S06_Var07 <- ts(subset_S06_clean_Var07$Var07[1:last_observation_index_S06])
ts_S05_Var02 <- ts(subset_S05_clean_Var02$Var02[1:last_observation_index_S05])
ts_S05_Var03 <- ts(subset_S05_clean_Var02$Var03[1:last_observation_index_S05])
ts_S04_Var01 <- ts(subset_S04_clean_Var01$Var01[1:last_observation_index_S04])
ts_S04_Var02 <- ts(subset_S04_clean_Var01$Var02[1:last_observation_index_S04])
ts_S03_Var05 <- ts(subset_S03_clean_Var07$Var05[1:last_observation_index_S03])
ts_S03_Var07 <- ts(subset_S03_clean_Var07$Var07[1:last_observation_index_S03])
ts_S02_Var02 <- ts(subset_S02_clean_Var02$Var02[1:last_observation_index_S02])
ts_S02_Var03 <- ts(subset_S02_clean_Var02$Var03[1:last_observation_index_S02])
ts_S01_Var01 <- ts(subset_S01_clean_Var01$Var01[1:last_observation_index_S01])
ts_S01_Var02 <- ts(subset_S01_clean_Var01$Var02[1:last_observation_index_S01])# Forecast using auto.arima
forecast_S06_Var05 <- forecast(auto.arima(ts_S06_Var05), h = 140)
forecast_S06_Var07 <- forecast(auto.arima(ts_S06_Var07), h = 140)
forecast_S05_Var02 <- forecast(auto.arima(ts_S05_Var02), h = 140)
forecast_S05_Var03 <- forecast(auto.arima(ts_S05_Var03), h = 140)
forecast_S04_Var01 <- forecast(auto.arima(ts_S04_Var01), h = 140)
forecast_S04_Var02 <- forecast(auto.arima(ts_S04_Var02), h = 140)
forecast_S03_Var05 <- forecast(auto.arima(ts_S03_Var05), h = 140)
forecast_S03_Var07 <- forecast(auto.arima(ts_S03_Var07), h = 140)
forecast_S02_Var02 <- forecast(auto.arima(ts_S02_Var02), h = 140)
forecast_S02_Var03 <- forecast(auto.arima(ts_S02_Var03), h = 140)
forecast_S01_Var01 <- forecast(auto.arima(ts_S01_Var01), h = 140)
forecast_S01_Var02 <- forecast(auto.arima(ts_S01_Var02), h = 140)
# Create dataframe for forecasts
forecasts_df_S06 <- data.frame(
SeriesInd = (subset_S06_clean_Var07$SeriesInd[last_observation_index_S06] + 1):(subset_S06_clean_Var07$SeriesInd[last_observation_index_S06] + 140),
category = rep("S06", 140),
Var05 = forecast_S06_Var05$mean,
Var07 = forecast_S06_Var07$mean)
forecasts_df_S05 <- data.frame(
SeriesInd = (subset_S05_clean_Var02$SeriesInd[last_observation_index_S05] + 1):(subset_S05_clean_Var02$SeriesInd[last_observation_index_S05] + 140),
category = rep("S05", 140),
Var02 = forecast_S05_Var02$mean,
Var03 = forecast_S05_Var03$mean)
forecasts_df_S04 <- data.frame(
SeriesInd = (subset_S04_clean_Var01$SeriesInd[last_observation_index_S04] + 1):(subset_S04_clean_Var01$SeriesInd[last_observation_index_S04] + 140),
category = rep("S04", 140),
Var01 = forecast_S04_Var01$mean,
Var02 = forecast_S04_Var02$mean)
forecasts_df_S03 <- data.frame(
SeriesInd = (subset_S03_clean_Var07$SeriesInd[last_observation_index_S03] + 1):(subset_S03_clean_Var07$SeriesInd[last_observation_index_S03] + 140),
category = rep("S03", 140),
Var05 = forecast_S03_Var05$mean,
Var07 = forecast_S03_Var07$mean)
forecasts_df_S02 <- data.frame(
SeriesInd = (subset_S02_clean_Var02$SeriesInd[last_observation_index_S02] + 1):(subset_S02_clean_Var02$SeriesInd[last_observation_index_S02] + 140),
category = rep("S02", 140),
Var02 = forecast_S02_Var02$mean,
Var03 = forecast_S02_Var03$mean)
forecasts_df_S01 <- data.frame(
SeriesInd = (subset_S01_clean_Var01$SeriesInd[last_observation_index_S01] + 1):(subset_S01_clean_Var01$SeriesInd[last_observation_index_S01] + 140),
category = rep("S01", 140),
Var01 = forecast_S01_Var01$mean,
Var02 = forecast_S01_Var02$mean)
# Remove the last 140 rows from subset_S06 to append forecasts
n_rows_S06 <- nrow(subset_S06)
subset_S06 <- subset_S06[1:(n_rows_S06 - 140), ]
n_rows_S05 <- nrow(subset_S05)
subset_S05 <- subset_S05[1:(n_rows_S05 - 140), ]
n_rows_S04 <- nrow(subset_S04)
subset_S04 <- subset_S04[1:(n_rows_S04 - 140), ]
n_rows_S03 <- nrow(subset_S03)
subset_S03 <- subset_S03[1:(n_rows_S03 - 140), ]
n_rows_S02 <- nrow(subset_S02)
subset_S02 <- subset_S02[1:(n_rows_S02 - 140), ]
n_rows_S01 <- nrow(subset_S01)
subset_S01 <- subset_S01[1:(n_rows_S01 - 140), ]# Combine original and forecasted data
combined_df_S06 <- rbind(subset_S06, forecasts_df_S06)
combined_df_S05 <- rbind(subset_S05, forecasts_df_S05)
combined_df_S04 <- rbind(subset_S04, forecasts_df_S04)
combined_df_S03 <- rbind(subset_S03, forecasts_df_S03)
combined_df_S02 <- rbind(subset_S02, forecasts_df_S02)
combined_df_S01 <- rbind(subset_S01, forecasts_df_S01)
# Add a label column to differentiate original and predicted data points
n_rows_combined_S06 <- nrow(combined_df_S06)
combined_df_S06$label <- "original"
combined_df_S06$label[(n_rows_combined_S06 - 139):n_rows_combined_S06] <- "predicted"
n_rows_combined_S05 <- nrow(combined_df_S05)
combined_df_S05$label <- "original"
combined_df_S05$label[(n_rows_combined_S05 - 139):n_rows_combined_S05] <- "predicted"
n_rows_combined_S04 <- nrow(combined_df_S04)
combined_df_S04$label <- "original"
combined_df_S04$label[(n_rows_combined_S04 - 139):n_rows_combined_S04] <- "predicted"
n_rows_combined_S03 <- nrow(combined_df_S03)
combined_df_S03$label <- "original"
combined_df_S03$label[(n_rows_combined_S03 - 139):n_rows_combined_S03] <- "predicted"
n_rows_combined_S02 <- nrow(combined_df_S02)
combined_df_S02$label <- "original"
combined_df_S02$label[(n_rows_combined_S02 - 139):n_rows_combined_S02] <- "predicted"
n_rows_combined_S01 <- nrow(combined_df_S01)
combined_df_S01$label <- "original"
combined_df_S01$label[(n_rows_combined_S01 - 139):n_rows_combined_S01] <- "predicted"
# Plot Var05 over SeriesInd for subset_S06
plot_SO6_var05 <- ggplot(combined_df_S06, aes(x = SeriesInd, y = Var05, color = label)) +
geom_line() +
labs(title = "Var05 over SeriesInd", x = "SeriesInd", y = "Var05") +
scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
theme_minimal()
plot_SO6_var07 <- ggplot(combined_df_S06, aes(x = SeriesInd, y = Var07, color = label)) +
geom_line() +
labs(title = "Var07 over SeriesInd", x = "SeriesInd", y = "Var07") +
scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
theme_minimal()
plot_S05_Var02 <- ggplot(combined_df_S05, aes(x = SeriesInd, y = Var02, color = label)) +
geom_line() +
labs(title = "Var02 over SeriesInd - S05", x = "SeriesInd", y = "Var02") +
scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
theme_minimal()
plot_S05_Var03 <- ggplot(combined_df_S05, aes(x = SeriesInd, y = Var03, color = label)) +
geom_line() +
labs(title = "Var03 over SeriesInd - S05", x = "SeriesInd", y = "Var03") +
scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
theme_minimal()
plot_S04_Var01 <- ggplot(combined_df_S04, aes(x = SeriesInd, y = Var01, color = label)) +
geom_line() +
labs(title = "Var01 over SeriesInd - S04", x = "SeriesInd", y = "Var01") +
scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
theme_minimal()
plot_S04_Var02 <- ggplot(combined_df_S04, aes(x = SeriesInd, y = Var02, color = label)) +
geom_line() +
labs(title = "Var02 over SeriesInd - S04", x = "SeriesInd", y = "Var02") +
scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
theme_minimal()
plot_S03_Var05 <- ggplot(combined_df_S03, aes(x = SeriesInd, y = Var05, color = label)) +
geom_line() +
labs(title = "Var05 over SeriesInd - S03", x = "SeriesInd", y = "Var05") +
scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
theme_minimal()
plot_S03_Var07 <- ggplot(combined_df_S03, aes(x = SeriesInd, y = Var07, color = label)) +
geom_line() +
labs(title = "Var07 over SeriesInd - S03", x = "SeriesInd", y = "Var07") +
scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
theme_minimal()
plot_S02_Var02 <- ggplot(combined_df_S02, aes(x = SeriesInd, y = Var02, color = label)) +
geom_line() +
labs(title = "Var02 over SeriesInd - S02", x = "SeriesInd", y = "Var02") +
scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
theme_minimal()
plot_S02_Var03 <- ggplot(combined_df_S02, aes(x = SeriesInd, y = Var03, color = label)) +
geom_line() +
labs(title = "Var03 over SeriesInd - S02", x = "SeriesInd", y = "Var03") +
scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
theme_minimal()
plot_S01_Var01 <- ggplot(combined_df_S01, aes(x = SeriesInd, y = Var01, color = label)) +
geom_line() +
labs(title = "Var01 over SeriesInd - S01", x = "SeriesInd", y = "Var01") +
scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
theme_minimal()
plot_S01_Var02 <- ggplot(combined_df_S01, aes(x = SeriesInd, y = Var02, color = label)) +
geom_line() +
labs(title = "Var02 over SeriesInd - S01", x = "SeriesInd", y = "Var02") +
scale_color_manual(values = c("original" = "blue", "predicted" = "red")) +
theme_minimal()In this project we walked through a full time series analysis to understand a de-identified dataset and ultimately to generate forecasts for 12 individual variables.